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sual emphasis on complex substructures within the net- 
work to highlight possible ambiguities and errors. 

Method We applied the new NETWORK graphical user 
interface, available via EMPOP (European DNA Profiling 
Group Mitochondrial DNA Population Database; www. 
empop.org) by means of two mtDNA data sets that were 
submitted for quality control. 

Results The quasi-median network torsi of the two data 
sets resulted in complex reticulations, suggesting ambig- 
uous data. To checkthe corresponding raw data, account- 
able nodes and connecting branches of the network 
could be identified by highlighting induced subgraphs 
with concurrent dimming of their complements. This is 
achieved by accentuating the relevant substructures in 
the network: mouse clicking on a node displays a list of 
all mtDNA haplotypes included in that node; the selec- 
tion of a branch specifies the mutation{s) connecting 
two nodes. It is indicated to evaluate these mutations by 
means of the raw data. 

Conclusion Inspection of the raw data confirmed the 
presence of phantom mutations due to suboptimal elec- 
trophoresis conditions and data misinterpretation. The 
network software proved to be a powerful tool to high- 
light problematic data and guide quality control of mtD- 
NA data tables. 
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It has been observed that the generation of mitochondrial 
(mt)DNA (population) data are prone to error (1-4). A valu- 
able tool for graphical representation of mtDNA data is 
quasi-median network (QMN) construction of reduced and 
filtered haplotypes (1). Clerical errors, sequencing artifacts, 
and other ambiguous data may induce character conflicts 
that increase the complexity of the network, pinpointing ini- 
tial points of action for quality control of mtDNA data sets 
(1-4). This tool is provided via the EMPOP database, a collab- 
orative project for the provision of high-quality mtDNA pop- 
ulation data for forensic purposes, which was initiated by 
the European DNA Profiling Group (EDNAP; http://www.isfg. 
org/ednap) in 1999. The acronym stands for"EDNAP mtDNA 
population database" and despite of its primary purpose of 
providing reliable frequency estimates, the website (www. 
empop.org) has regularly been used for quality control (QC) 
of published and newly submitted population data (3,4). 
QIVlNs form one part of the QC concept performed by EIVI- 
POP when mtDNA population data are submitted for pub- 
lication in Forensic Science international Genetics (5) and in- 
ternational Journai of Legal Medicine (6) and thus contribute 
to the quality improvement of published mtDNA data sets. 
Also, all haplotypes presented in the mtDNA database EM- 
POP (3) undergo rigorous quality control prior to upload. This 
procedure has proven to be successful in detecting errors in 
individual data sets and collaborative exercises (4,7,8). 

While the calculation and the drawing of QIVlNs is support- 
ed by software (NETWORK) freely accessible via the EMPOP 
website, its successful interpretation and evaluation de- 
pends on the experience of the user. Users have brought 
to our attention that QMNs generated by NETWORK are 
sometimes too complex and fraught with reticulations, 
rendering the identification of potential errors difficult. In 
particular, data sets of large sample sizes (>500) were con- 
cerned, as well as data harboring haplotypes from distant 
phylogenies (eg. South American populations including 
haplogroup L, M, and N lineages). 

In this study, we describe the application of a new graphi- 
cal user interface (GUI) of the NETWORK tool that offers the 
possibility to visually highlight selected structures within 
the graph for a better distinction of reticulations in complex 
areas (9). Further, haplotypes are now directly linked to the 
graphical representation of the nodes and can be exam- 
ined in a convenient way to identify potential errors such as 
phantom mutations, clerical errors, violation of alignment 
rules, and artificial recombination. The performance and 
features of the new GUI are demonstrated by example 
of two data sets submitted to EMPOP QC. 



MATERIAL AND METHODS 

The study took place at the Institute of Legal Medicine, 
Innsbruck Medical University, during summer 201 2. The ap- 
plication of the new NETWORK GUI was demonstrated by 
two mtDNA population data sets that were submitted for 
QC.The data sets are kept anonymous and comprised 320 
mtDNA haplotypes from West Eurasia (data set A) and 230 
haplotypes from East Asia (data set B). QMN analysis was 
conducted using EMPOP NETWORK as ouflined earlier (4). 
The removal of rapidly evolving mutations is critical for the 
readability of QMNs. The user can choose between different 
types of filters depending on the application (4). Here, the 
data sets were filtered with EMPOPall_RI i, which removed 
all mutations observed and documented by raw lane data 
in that respective EMPOP release (Release 1 1) (3). Thus, only 
newly observed differences to the revised Cambridge Ref- 
erence Sequence (1 0) remained in the network, which pro- 
vides a first overview of the data quality. Authors were con- 
tacted after EMPOP QC and asked to submit raw data of the 
haplotypes in question to evaluate the QMN findings. 

RESULTS 

MtDNA population data sets, as well as individual mtDNA 
haplotypes, can be quality controlled using the freely ac- 
cessible EMPOP NETWORK tool. This procedure involves 
two consecutive steps: first, all haplotypes undergo plau- 
sibility checks. The rCRS-coded haplotypes are checked for 
plausibility, eg, with regard to sequence range violation 
(eg,T489C in a defined range of 73-340), reference bias (eg, 
A263A), double specification of mutations, and wrong no- 
tations of insertions and deletions. We have demonstrated 
earlier that many errors are already unmasked at this stage 
(4). Second, quasi-medians are calculated based on the set- 
tings selected by the user. QMN analysis involves the ap- 
plication of filters to remove highly recurrent mutations 
{EMPOPspeedy) that would otherwise lead to complex struc- 
tures in the network and reduce its readability. The most 
comprehensive filter includes all documented differences 
to the rCRS (EMPOPall) (3), which reduces the complexity 
of the network to new observations. We recommend using 
this filter as a first indicator of the quality of an mtDNA data 
set, as it provides a first overview on unobserved mutations. 
The current EMPOP release (Rl 1) holds 1694 documented 
differences at 1073 positions within the control region and 
includes a large portion of known lineages. 

Its application to data already included in EMPOP results in 
the QMN of a single node as all annotated differences to 
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the rCRS in that data set were filtered (Figure 1 A). MtDNA 
data from already sampled populations (eg, Westeurasian 
populations) that were generated under forensic guide- 
lines (11-13) typically result in simple QMN torsi after pas- 
sage through the EMPOPall filter (Figure IB) as only few 
novel differences to the rCRS are observed. These can then 
be evaluated by the raw lane data (Supplementary Figure 
1), which in this case confirmed all observations. We note, 
that new lineages are continuously observed especially 
in cases where remote populations were sampled. These 
then leave their haplotypic signatures in the QIVlNs. 



Type: Wework 

Name: Minneseta 868 fim m rcle3_B50_0 nw.dnw 
Filler HolspoLs_EMPOPalLR1 1 
Range: 16024-576 
Aclive mede: draw made 

A 




FIGURE 1 . Quasi-median networks (QMNs) generated from (A) 
868 haplotypes from Minnesota already included in EMPOP 
(accession numbers EMP00402-EIVIP00406) and from (B) 201 
haplotypes from Jordan (submitted to EMPOP for quality 
control). For both data sets the EMPOPalLRT) filter including 
all differences to the rCRS observed In EMPOP Release 1 1 was 
applied. Thus, data sets already Included In EMPOP collapse 
Into a single node (A). Data sets not yet Included In EMPOP 
produce structures that are reduced to the newly observed dif- 
ferences to the rCRS (B). Here the QMN shows a simple star-like 
structure displaying five polymorphisms not yet observed In 
EMPOP. The branch labeled "H16214S"for example represents 
a point heteroplasmy at position 16214 In haplotype h5 with 
haplogroup status D4I. This observation was confirmed by the 
raw data (Supplementary Figure 1). 



QMN analysis of data set A 

The calculation of the QMN of data set A comprising 320 
haplotypes of west Eurasian provenance with the EMPO- 
Pall_Rl 1 filter resulted in a complex QMN torso (Figure 2A; 
see Figure IB as contrast). A user would be interested in 
identifying those branches that cause the complex struc- 
tures as they represent yet unobserved mutations that may 
be erroneous. Using mouse over the new GUI allows for the 
visual accentuation of linked nodes and branches, while 
the complementary substructure is dimmed. Once a sub- 
tree of interest has been identified, individual nodes can be 
specifically selected by mouse-click to view all haplotypes 
that are included in that node and thus share the differ- 
ence to the rCRS indicated by the branch (Supplementary 
Figure 2). When evaluating QMNs filtered with EMPOPall it 
is recommended to start reviewing abundant mutations. 
For example, the QMN torso contained 1 8 haplotypes that 
shared A366G (nodes h6 and h20. Figure 2B).The selection 
of node h6 by mouse-click resulted in a list of the 1 7 affect- 
ed haplotypes (Supplementary Figure 2). The high abun- 
dance of A366G in various haplogroups (RO, HIS, H2a2b, 
H6, HVO, Jlc2, and Kla) was surprising and worth inspect- 
ing the respective raw data. This review clearly indicated 
the presence of a phantom mutation at position 366 due 
to overlaid sequence electropherograms originating from 
length heteroplasmy in the HVS-2 C-tract around position 
309 (Supplementary Figure 3). The adenine bases 5 prime 
of position 366 were shifted downstream and masked the 
G signal at position 366. Additional reverse sequencing 
reactions would help calling the correct variant. This first 
part of the QMN review already suggested that only single 
stranded sequencing information had been used to gen- 
erate the reported consensus haplotypes, which does not 
meet the recommendations in forensic genetics (11,12). 
These findings are confirmed by other phantom mutations 
in HVS-2 downstream of the C-tract, eg, c320T c320G (Fig- 
ure 2C). These (and other) errors in this data set have been 
reported as hot-spots for phantom mutations earlier (1 4). 

QMN analysis of data set B 

Sequencing problems similar to those reported in data set 
A were also visible in data set B, an East Asian population 
sample including 230 haplotypes (eg, phantom mutation 
at position 366, Figure 3). More worrying was the persis- 
tent deletion at position 16038, which was reported in 
173 instances (75% of all samples). Selecting the branch 
that carries the deletion at 16038 did not change the 
appearance of the entire network, because of the 
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enormous number of affected erroneous haplotypes. The 
sequence raw data suggested that the analysis suffered 




FIGURE 2. Quasi-median networl<s (QiVlN) torso of 320 mtDNA 
haplotypes from a West Eurasian population sample after pas- 
sage through the EMPOPall (Release 1 1) filter (A). The complex- 
ity of the torso is caused by mutations that were not observed 
in EMPOP Release 1 T (B) The accentuated sub-graph of the 
QIVIN torso. Node h6 was selected by mouse-click. This node 
together with node h20 included 1 8 haplotypes that all carry 
mutation A366G. (C) The accentuated sub-graph of the QIVIN 
torso selecting node hSO (branch R284C).The linking branches 
C320T and c320G represent phantom mutations that are also 
caused by length heteroplasmy in the HVS-2 C-tract. 
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FIGURE 3. Quasi-median networks (QMN) torso of 230 mtDNA 
haplotypes from an East Asian population sample after 
passage through the EMPOPall (Release 1 1) filter. Phantom 
mutation G366A previously discussed for data set A (Figure 2B) 
is also observed in this data set. 

from electrophoretic mobility problems, which is why the 
two A signals at positions 16038 and 16039 merged into 
one single broad peak (Supplementary Figure 4). Another 
eye-catching observation was the frequent occurrence of 
C31 IT (n = 106, 46%, Supplementary Figure 5), which was 
absent in all EMPOP data collected so far. In contrast, the 
expected insertion 315. IC was missing in those cases, sug- 
gesting that this part of the HVS-2 C-tract was not reported 
in 3'convention, as laid down in the forensic genetic recom- 
mendations (1 1). 

DISCUSSION 

The graphical representation of an mtDNA data set as QMN 
is a valuable tool for inspecting haplotypes and mutations 
that would otherwise be difficult to decipher in a tabular 
list. As detailed elsewhere (2,1 5), recurrent mutations need 
to be filtered and haplotypes reduced to the relevant infor- 
mation to decrease the complexity of the QMN and make 
it readable for the human eye. We explicitly note here that 
any interpretation of the data quality by QMN can only re- 
fer to those mutations that remain in the reduced data set. 
This is why QMN forms only one - albeit important - part 
of mtDNA data quality review. QMN analysis can be per- 
formed via the EMPQP website. Based on user feedback, 
we here presented an improved and updated version of 
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this tool and demonstrated its utility using two data sets 
submitted to ElVlPOP for review. 

The new network editor software presents features that 
considerably improve the power of quasi-median net- 
working for data quality control. The main advantage is 
the possibility to accentuate subgraphs while the remain- 
ing network (complement of the induced subgraph) is 
dimmed. All nodes and branches causing the increased 
complexity become better visible. For convenient identifi- 
cation of the corresponding haplotypes, sample identifiers 
are listed upon selection of a node with the mouse. The se- 
quence electropherograms of these samples should be ex- 
amined with great scrutiny and appropriate actions taken 
(eg, correction of base calls, repetition of sequencing reac- 
tions with alternative primers, etc). Further practical appli- 
cations included in the new network editor GUI are adjust- 
able drawing and camera settings, with which nodes and 
branches can be adapted in color, size, font settings, and 
other Single nodes and branches can be moved to change 
the structure and thus the visibility of the graph. Branches 
representing identical mutations stay parallel. Supported 
export formats include GIF SVG, and EPS. 

IVltDNA data have been quality reviewed with EMPOP NET- 
WORK since 2006, and since 2010 the journals Forensic Sci- 
ence International Genetics (11) and International Journal 
of Legal Medicine (6) have required authors to have their 
mtDNA data quality controlled by EMPOP prior to submis- 
sion of the manuscript to the journal. It is our experience 
that more than half of the submissions require substan- 
tial changes due to data idiosyncrasies. The forensic com- 
munity is particularly sensitive to quality issues. Neverthe- 
less, several calls for increased quality in forensic genetics 
(16,17) have been ignored. With a move to massively par- 
allel sequencing technologies the problem will likely acer- 
bate (1 8), as the increased amount of sequence data likely 
contains more artifacts than Sanger-type sequence data, 
and the application of diverse alignment algorithms sig- 
nificantly affects the sequence coverage and thus the re- 
sulting consensus sequences (19). Powerful tools for data 
review and OC will become indispensable. 
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